Spark: DataFrame

A DataFrame is the most common Structured API and simply represents a table of data with rows and columns.It represent a distributed collection of data, in which data is organized rows and columns that provides operations to filter, group, or compute aggregates, and can be used with Spark SQL.It is conceptually equivalent to a table in a relational database or a data frame.Spark DataFrames can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. DataFrames allow the processing of huge amounts of data.

Why DataFrame

When Apache Spark 1.3 was launched, it came with a new API called DataFrames that resolved the limitations of performance and scaling that occur while using RDDs.
When there is not much storage space in memory or on disk, RDDs do not function properly as they get exhausted.
Spark RDDs do not have the concept of schema—the structure of a database that defines the objects of it. RDDs store both structured and unstructured data together, which is not very efficient.
RDDs cannot modify the system in such a way that it runs more efficiently. RDDs do not allow us to debug errors during the runtime. They store the data as a collection of Java objects.
RDDs use serialization (converting an object into a stream of bytes to allow faster processing) and garbage collection (an automatic memory management technique that detects unused objects and frees them from memory) techniques. This increases the overhead on the memory of the system as they are very lengthy.

This was when Spark DataFrames were introduced to overcome the limitations Spark RDDs had.

Custom Memory Management:- A lot of memory is saved as the data is stored in off-heap memory in binary format. Apart from this, there is no Garbage Collection overhead. Expensive Java serialization is also avoided. Since the data is stored in binary format and the schema of memory is known.
Optimized Execution plan:- This is also known as the query optimizer. Using this, an optimized execution plan is created for the execution of a query. Once the optimized plan is created final execution takes place on RDDs of Spark.
Multiple Programming languages:- The best property of DataFrames in Spark is its support for multiple languages, which makes it easier for programmers from different programming background to use it.
Multiple Data Sources:- DataFrames in Spark can support a large variety of sources of data and Process Structured and Semi-Structured Data. DataFrames in Spark support R–Programming Language, Python, Scala, and Java.
DataFrame APIs support slicing and dicing the data. It can perform operations like select and filter upon rows, columns.
Statistical data is always prone to have Missing values, Range Violations, and Irrelevant values. The user can manage the missing data explicitly by using DataFrames.

Features of Spark DataFrame

DataFrame in spark is Immutable in nature. Like the Resilient Distributed Datasets, the data present in a DataFrame cannot be altered.
Lazy Evaluation is the key to the remarkable performance offered by the spark. DataFrames in Spark will not throw an output on to the screen unless an action operation is provoked.
The Distributed Memory technique used to handle data makes them fault tolerant.
Like Resilient Distributed Datasets, DataFrames in Spark extend the property of Distributed memory model.
The only way to alter or modify the data in a DataFrame would be by applying Transformations.

Spark

DataFrame

No comments:

Post a Comment